NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Balepur, Nishant; Rudinger, Rachel; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
more » « less
Free, publicly-accessible full text available July 27, 2026
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Balepur, Nishant; Padmakumar, Vishakh; Yang, Fumeng; Feng, Shi; Rudinger, Rachel; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
more » « less
Free, publicly-accessible full text available July 27, 2026
Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs

https://doi.org/10.18653/v1/2025.acl-long.161

Sarkar, Rupak; Srikanth, Neha; Pellegrin, Taylor; Rudinger, Rachel; Bonial, Claire; Resnik, Philip (July 2025, Association for Computational Linguistics)

Free, publicly-accessible full text available July 1, 2026
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?

https://doi.org/10.18653/v1/2025.naacl-short.5

Balepur, Nishant; Gu, Feng; Ravichander, Abhilasha; Feng, Shi; Boyd-Graber, Jordan; Rudinger, Rachel (January 2025, emae)

Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
more » « less
Full Text Available
Use Defines Possibilities: Reasoning about Object Function to Interpret and Execute Robot Instructions

Shichman, Mollie; Bonial, Claire; Blodgett, Austin; Hudson, Taylor; Ferraro, Francis; Rudinger, Rachel (June 2023, 15th International Conference on Computational Semantics (IWCS))

Language models have shown great promise in common-sense related tasks. However, it remains unseen how they would perform in the context of physically situated human-robot interactions, particularly in disaster-relief scenarios. In this paper, we develop a language model evaluation dataset with more than 800 cloze sentences, written to probe for the function of over 200 objects. The sentences are divided into two tasks: an “easy” task where the language model has to choose between vocabulary with different functions (Task 1), and a “challenge” where it has to choose between vocabulary with the same function, yet only one vocabulary item is appropriate given real world constraints on functionality (Task 2). DistilBERT performs with about 80% accuracy for both tasks. To investigate how annotator variability affected those results, we developed a follow-on experiment where we compared our original results with wrong answers chosen based on embedding vector distances. Those results showed increased precision across documents but a 15% decrease in accuracy. We conclude that language models do have a strong knowledge basis for object reasoning, but will require creative fine-tuning strategies in order to be successfully deployed.
more » « less
Full Text Available
Lexicosyntactic Inference in Neural Models

White, Aaron Steven; Rudinger, Rachel; Rawlins, Kyle; Van Durme, Benjamin (October 2018, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Lexicosyntactic Inference in Neural Models

https://doi.org/10.18653/v1/D18-1501

White, Aaron Steven; Rudinger, Rachel; Rawlins, Kyle; Van Durme, Benjamin (January 2018, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing)

We investigate neural models’ ability to capture lexicosyntactic inferences: inferences triggered by the interaction of lexical and syntactic information. We take the task of event factuality prediction as a case study and build a factuality judgment dataset for all English clause-embedding verbs in various syntactic contexts. We use this dataset, which we make publicly available, to probe the behavior of current state-of-the-art neural systems, showing that these systems make certain systematic errors that are clearly visible through the lens of factuality prediction.
more » « less
Full Text Available
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation

Poliak, Adam; Haldar, Aparajita; Rudinger, Rachel; Hu, J. Edward; Pavlick, Ellie; White, Aaron Steven; Van Durme, Benjamin (October 2018, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation

https://doi.org/10.18653/v1/D18-1007

Poliak, Adam; Haldar, Aparajita; Rudinger, Rachel; Hu, J. Edward; Pavlick, Ellie; White, Aaron Steven; Van Durme, Benjamin (January 2018, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing)

We present a large-scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. We refer to our collection as the DNC: Diverse Natural Language Inference Collection. The DNC is available online at https://www.decomp.net, and will grow over time as additional resources are recast and added from novel sources.
more » « less
Full Text Available

Search for: All records